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Abstract 

Bigtable is a distributed storage system for managing 
structured data that is designed to scale to a very large 
size: petabytes of data across thousands of commodity 
servers. Many projects at Google store data in Bigtable, 
including web indexing, Google Earth, and Google Fi¬ 
nance. These applications place very different demands 
on Bigtable, both in terms of data size (from URLs to 
web pages to satellite imagery) and latency requirements 
(from backend bulk processing to real-time data serving). 
Despite these varied demands, Bigtable has successfully 
provided a flexible, high-performance solution for all of 
these Google products. In this paper we describe the sim¬ 
ple data model provided by Bigtable, which gives clients 
dynamic control over data layout and format, and we de¬ 
scribe the design and implementation of Bigtable. 

1 Introduction 

Over the last two and a half years we have designed, 
implemented, and deployed a distributed storage system 
for managing structured data at Google called Bigtable. 
Bigtable is designed to reliably scale to petabytes of 
data and thousands of machines. Bigtable has achieved 
several goals: wide applicability, scalability, high per¬ 
formance, and high availability. Bigtable is used by 
more than sixty Google products and projects, includ¬ 
ing Google Analytics, Google Finance, Orkut, Person¬ 
alized Search, Writely, and Google Earth. These prod¬ 
ucts use Bigtable for a variety of demanding workloads, 
which range from throughput-oriented batch-processing 
jobs to latency-sensitive serving of data to end users. 
The Bigtable clusters used by these products span a wide 
range of configurations, from a handful to thousands of 
servers, and store up to several hundred terabytes of data. 

In many ways, Bigtable resembles a database: it shares 
many implementation strategies with databases. Paral¬ 
lel databases [14] and main-memory databases [13] have 


achieved scalability and high performance, but Bigtable 
provides a different interface than such systems. Bigtable 
does not support a full relational data model; instead, it 
provides clients with a simple data model that supports 
dynamic control over data layout and format, and al¬ 
lows clients to reason about the locality properties of the 
data represented in the underlying storage. Data is in¬ 
dexed using row and column names that can be arbitrary 
strings. Bigtable also treats data as uninterpreted strings, 
although clients often serialize various forms of struc¬ 
tured and semi-structured data into these strings. Clients 
can control the locality of their data through careful 
choices in their schemas. Finally, Bigtable schema pa¬ 
rameters let clients dynamically control whether to serve 
data out of memory or from disk. 

Section 2 describes the data model in more detail, and 
Section 3 provides an overview of the client API. Sec¬ 
tion 4 briefly describes the underlying Google infrastruc¬ 
ture on which Bigtable depends. Section 5 describes the 
fundamentals of the Bigtable implementation, and Sec¬ 
tion 6 describes some of the refinements that we made 
to improve Bigtable’s performance. Section 7 provides 
measurements of Bigtable’s performance. We describe 
several examples of how Bigtable is used at Google 
in Section 8, and discuss some lessons we learned in 
designing and supporting Bigtable in Section 9. Fi¬ 
nally, Section 10 describes related work, and Section 11 
presents our conclusions. 


2 Data Model 

A Bigtable is a sparse, distributed, persistent multi¬ 
dimensional sorted map. The map is indexed by a row 
key, column key, and a timestamp; each value in the map 
is an uninterpreted array of bytes. 

(row:string, column:string, time:int64) —> string 
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"contents:" "anchor:cnnsi.com" "anchor:my.look.ca" 



Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con¬ 
tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page 
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchoncnnsi.com 
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t$, and f6. 


We settled on this data model after examining a variety 
of potential uses of a Bigtable-like system. As one con¬ 
crete example that drove some of our design decisions, 
suppose we want to keep a copy of a large collection of 
web pages and related information that could be used by 
many different projects; let us call this particular table 
the Web table. In Webtable, we would use URLs as row 
keys, various aspects of web pages as column names, and 
store the contents of the web pages in the contents: col¬ 
umn under the timestamps when they were fetched, as 
illustrated in Figure 1. 


Rows 

The row keys in a table are arbitrary strings (currently up 
to 64KB in size, although 10-100 bytes is a typical size 
for most of our users). Every read or write of data under 
a single row key is atomic (regardless of the number of 
different columns being read or written in the row), a 
design decision that makes it easier for clients to reason 
about the system’s behavior in the presence of concurrent 
updates to the same row. 

Bigtable maintains data in lexicographic order by row 
key. The row range for a table is dynamically partitioned. 
Each row range is called a tablet, which is the unit of dis¬ 
tribution and load balancing. As a result, reads of short 
row ranges are efficient and typically require communi¬ 
cation with only a small number of machines. Clients 
can exploit this property by selecting their row keys so 
that they get good locality for their data accesses. For 
example, in Webtable, pages in the same domain are 
grouped together into contiguous rows by reversing the 
hostname components of the URLs. For example, we 
store data for maps.google.com/index.html under the 
key com.google.maps/index.html. Storing pages from 
the same domain near each other makes some host and 
domain analyses more efficient. 


Column Families 

Column keys are grouped into sets called column fami¬ 
lies, which form the basic unit of access control. All data 
stored in a column family is usually of the same type (we 
compress data in the same column family together). A 
column family must be created before data can be stored 
under any column key in that family; after a family has 
been created, any column key within the family can be 
used. It is our intent that the number of distinct column 
families in a table be small (in the hundreds at most), and 
that families rarely change during operation. In contrast, 
a table may have an unbounded number of columns. 

A column key is named using the following syntax: 
family-.qualifier. Column family names must be print¬ 
able, but qualifiers may be arbitrary strings. An exam¬ 
ple column family for the Webtable is language, which 
stores the language in which a web page was written. We 
use only one column key in the language family, and it 
stores each web page’s language ID. Another useful col¬ 
umn family for this table is anchor; each column key in 
this family represents a single anchor, as shown in Fig¬ 
ure 1. The qualifier is the name of the referring site; the 
cell contents is the link text. 

Access control and both disk and memory account¬ 
ing are performed at the column-family level. In our 
Webtable example, these controls allow us to manage 
several different types of applications: some that add new 
base data, some that read the base data and create derived 
column families, and some that are only allowed to view 
existing data (and possibly not even to view all of the 
existing families for privacy reasons). 

Timestamps 

Each cell in a Bigtable can contain multiple versions of 
the same data; these versions are indexed by timestamp. 
Bigtable timestamps are 64-bit integers. They can be as¬ 
signed by Bigtable, in which case they represent “real 
time” in microseconds, or be explicitly assigned by client 
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// Open the table 

Table *T = OpenOrDie("/bigtable/web/webtable"); 

// Write a new anchor and delete an old anchor 
RowMutation rl (T, "com.cnn.www"); 
rl.Set("anchor:www.c-span.org ", "CNN"); 
rl.Delete ("anchor:www.abc.com"); 

Operation op; 

Apply(&op, &rl); 

Figure 2: Writing to Bigtable. 

applications. Applications that need to avoid collisions 
must generate unique timestamps themselves. Different 
versions of a cell are stored in decreasing timestamp or¬ 
der, so that the most recent versions can be read first. 

To make the management of versioned data less oner¬ 
ous, we support two per-column-family settings that tell 
Bigtable to garbage-collect cell versions automatically. 
The client can specify either that only the last n versions 
of a cell be kept, or that only new-enough versions be 
kept (e.g., only keep values that were written in the last 
seven days). 

In our Webtable example, we set the timestamps of 
the crawled pages stored in the contents: column to 
the times at which these page versions were actually 
crawled. The garbage-collection mechanism described 
above lets us keep only the most recent three versions of 
every page. 

3 API 

The Bigtable API provides functions for creating and 
deleting tables and column families. It also provides 
functions for changing cluster, table, and column family 
metadata, such as access control rights. 

Client applications can write or delete values in 
Bigtable, look up values from individual rows, or iter¬ 
ate over a subset of the data in a table. Figure 2 shows 
C++ code that uses a RowMutation abstraction to per¬ 
form a series of updates. (Irrelevant details were elided 
to keep the example short.) The call to Apply performs 
an atomic mutation to the Webtable: it adds one anchor 
to www.cnn.com and deletes a different anchor. 

Figure 3 shows C++ code that uses a Scanner ab¬ 
straction to iterate over all anchors in a particular row. 
Clients can iterate over multiple column families, and 
there are several mechanisms for limiting the rows, 
columns, and timestamps produced by a scan. For ex¬ 
ample, we could restrict the scan above to only produce 
anchors whose columns match the regular expression 
anchor:*.cnn.com, or to only produce anchors whose 
timestamps fall within ten days of the current time. 


Scanner scanner(T); 

ScanStream *stream; 

stream = scanner.FetchColumnFamily("anchor"); 
stream->SetReturnAHVersions () ; 
scanner.Lookup("com.cnn.www"); 
for (; !stream->Done(); stream->Next()) { 

printf("%s %s %lld %s\n", 
scanner.RowName(), 
stream->ColumnName(), 
stream->MicroTimestamp(), 
stream->Value()); 

} 

Figure 3: Reading from Bigtable. 

Bigtable supports several other features that allow the 
user to manipulate data in more complex ways. First, 
Bigtable supports single-row transactions, which can be 
used to perform atomic read-modify-write sequences on 
data stored under a single row key. Bigtable does not cur¬ 
rently support general transactions across row keys, al¬ 
though it provides an interface for batching writes across 
row keys at the clients. Second, Bigtable allows cells 
to be used as integer counters. Finally, Bigtable sup¬ 
ports the execution of client-supplied scripts in the ad¬ 
dress spaces of the servers. The scripts are written in a 
language developed at Google for processing data called 
Sawzall [28]. At the moment, our Sawzall-based API 
does not allow client scripts to write back into Bigtable, 
but it does allow various forms of data transformation, 
filtering based on arbitrary expressions, and summariza¬ 
tion via a variety of operators. 

Bigtable can be used with MapReduce [12], a frame¬ 
work for running large-scale parallel computations de¬ 
veloped at Google. We have written a set of wrappers 
that allow a Bigtable to be used both as an input source 
and as an output target for MapReduce jobs. 

4 Building Blocks 

Bigtable is built on several other pieces of Google in¬ 
frastructure. Bigtable uses the distributed Google File 
System (GFS) [17] to store log and data files. A Bigtable 
cluster typically operates in a shared pool of machines 
that run a wide variety of other distributed applications, 
and Bigtable processes often share the same machines 
with processes from other applications. Bigtable de¬ 
pends on a cluster management system for scheduling 
jobs, managing resources on shared machines, dealing 
with machine failures, and monitoring machine status. 

The Google SSTable file format is used internally to 
store Bigtable data. An SSTable provides a persistent, 
ordered immutable map from keys to values, where both 
keys and values are arbitrary byte strings. Operations are 
provided to look up the value associated with a specified 
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key, and to iterate over all key/value pairs in a specified 
key range. Internally, each SSTable contains a sequence 
of blocks (typically each block is 64KB in size, but this 
is configurable). A block index (stored at the end of the 
SSTable) is used to locate blocks; the index is loaded 
into memory when the SSTable is opened. A lookup 
can be performed with a single disk seek: we first find 
the appropriate block by performing a binary search in 
the in-memory index, and then reading the appropriate 
block from disk. Optionally, an SSTable can be com¬ 
pletely mapped into memory, which allows us to perform 
lookups and scans without touching disk. 

Bigtable relies on a highly-available and persistent 
distributed lock service called Chubby [8], A Chubby 
service consists of five active replicas, one of which is 
elected to be the master and actively serve requests. The 
service is live when a majority of the replicas are running 
and can communicate with each other. Chubby uses the 
Paxos algorithm [9, 23] to keep its replicas consistent in 
the face of failure. Chubby provides a namespace that 
consists of directories and small files. Each directory or 
file can be used as a lock, and reads and writes to a file 
are atomic. The Chubby client library provides consis¬ 
tent caching of Chubby files. Each Chubby client main¬ 
tains a session with a Chubby service. A client’s session 
expires if it is unable to renew its session lease within the 
lease expiration time. When a client’s session expires, it 
loses any locks and open handles. Chubby clients can 
also register callbacks on Chubby files and directories 
for notification of changes or session expiration. 

Bigtable uses Chubby for a variety of tasks; to ensure 
that there is at most one active master at any time; to 
store the bootstrap location of Bigtable data (see Sec¬ 
tion 5.1); to discover tablet servers and finalize tablet 
server deaths (see Section 5.2); to store Bigtable schema 
information (the column family information for each ta¬ 
ble); and to store access control lists. If Chubby becomes 
unavailable for an extended period of time, Bigtable be¬ 
comes unavailable. We recently measured this effect 
in 14 Bigtable clusters spanning 11 Chubby instances. 
The average percentage of Bigtable server hours during 
which some data stored in Bigtable was not available due 
to Chubby unavailability (caused by either Chubby out¬ 
ages or network issues) was 0.0047%. The percentage 
for the single cluster that was most affected by Chubby 
unavailability was 0.0326%. 


5 Implementation 

The Bigtable implementation has three major compo¬ 
nents: a library that is linked into every client, one mas¬ 
ter server, and many tablet servers. Tablet servers can be 


dynamically added (or removed) from a cluster to acco¬ 
modate changes in workloads. 

The master is responsible for assigning tablets to tablet 
servers, detecting the addition and expiration of tablet 
servers, balancing tablet-server load, and garbage col¬ 
lection of files in GFS. In addition, it handles schema 
changes such as table and column family creations. 

Each tablet server manages a set of tablets (typically 
we have somewhere between ten to a thousand tablets per 
tablet server). The tablet server handles read and write 
requests to the tablets that it has loaded, and also splits 
tablets that have grown too large. 

As with many single-master distributed storage sys¬ 
tems [17, 21], client data does not move through the mas¬ 
ter: clients communicate directly with tablet servers for 
reads and writes. Because Bigtable clients do not rely on 
the master for tablet location information, most clients 
never communicate with the master. As a result, the mas¬ 
ter is lightly loaded in practice. 

A Bigtable cluster stores a number of tables. Each ta¬ 
ble consists of a set of tablets, and each tablet contains 
all data associated with a row range. Initially, each table 
consists of just one tablet. As a table grows, it is auto¬ 
matically split into multiple tablets, each approximately 
100-200 MB in size by default. 

5.1 Tablet Location 

We use a three-level hierarchy analogous to that of a B + - 
tree [10] to store tablet location information (Figure 4). 


UserTablel 



Figure 4: Tablet location hierarchy. 


The first level is a file stored in Chubby that contains 
the location of the root tablet. The root tablet contains 
the location of all tablets in a special METADATA table. 
Each METADATA tablet contains the location of a set of 
user tablets. The root tablet is just the first tablet in the 
METADATA table, but is treated specially—it is never 
split—to ensure that the tablet location hierarchy has no 
more than three levels. 

The METADATA table stores the location of a tablet 
under a row key that is an encoding of the tablet’s table 
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identifier and its end row. Each METADATA row stores 
approximately 1KB of data in memory. With a modest 
limit of 128 MB METADATA tablets, our three-level lo¬ 
cation scheme is sufficient to address 2 34 tablets (or 2 61 
bytes in 128 MB tablets). 

The client library caches tablet locations. If the client 
does not know the location of a tablet, or if it discov¬ 
ers that cached location information is incorrect, then 
it recursively moves up the tablet location hierarchy. 
If the client’s cache is empty, the location algorithm 
requires three network round-trips, including one read 
from Chubby. If the client’s cache is stale, the location 
algorithm could take up to six round-trips, because stale 
cache entries are only discovered upon misses (assuming 
that METADATA tablets do not move very frequently). 
Although tablet locations are stored in memory, so no 
GFS accesses are required, we further reduce this cost 
in the common case by having the client library prefetch 
tablet locations: it reads the metadata for more than one 
tablet whenever it reads the METADATA table. 

We also store secondary information in the 
METADATA table, including a log of all events per¬ 
taining to each tablet (such as when a server begins 
serving it). This information is helpful for debugging 
and performance analysis. 

5.2 Tablet Assignment 

Each tablet is assigned to one tablet server at a time. The 
master keeps track of the set of live tablet servers, and 
the current assignment of tablets to tablet servers, in¬ 
cluding which tablets are unassigned. When a tablet is 
unassigned, and a tablet server with sufficient room for 
the tablet is available, the master assigns the tablet by 
sending a tablet load request to the tablet server. 

Bigtable uses Chubby to keep track of tablet servers. 
When a tablet server starts, it creates, and acquires an 
exclusive lock on, a uniquely-named file in a specific 
Chubby directory. The master monitors this directory 
(the servers directory ) to discover tablet servers. A tablet 
server stops serving its tablets if it loses its exclusive 
lock: e.g., due to a network partition that caused the 
server to lose its Chubby session. (Chubby provides an 
efficient mechanism that allows a tablet server to check 
whether it still holds its lock without incurring network 
traffic.) A tablet server will attempt to reacquire an ex¬ 
clusive lock on its file as long as the file still exists. If the 
file no longer exists, then the tablet server will never be 
able to serve again, so it kills itself. Whenever a tablet 
server terminates (e.g., because the cluster management 
system is removing the tablet server’s machine from the 
cluster), it attempts to release its lock so that the master 
will reassign its tablets more quickly. 


The master is responsible for detecting when a tablet 
server is no longer serving its tablets, and for reassign¬ 
ing those tablets as soon as possible. To detect when a 
tablet server is no longer serving its tablets, the master 
periodically asks each tablet server for the status of its 
lock. If a tablet server reports that it has lost its lock, 
or if the master was unable to reach a server during its 
last several attempts, the master attempts to acquire an 
exclusive lock on the server’s file. If the master is able to 
acquire the lock, then Chubby is live and the tablet server 
is either dead or having trouble reaching Chubby, so the 
master ensures that the tablet server can never serve again 
by deleting its server file. Once a server’s file has been 
deleted, the master can move all the tablets that were pre¬ 
viously assigned to that server into the set of unassigned 
tablets. To ensure that a Bigtable cluster is not vulnera¬ 
ble to networking issues between the master and Chubby, 
the master kills itself if its Chubby session expires. How¬ 
ever, as described above, master failures do not change 
the assignment of tablets to tablet servers. 

When a master is started by the cluster management 
system, it needs to discover the current tablet assign¬ 
ments before it can change them. The master executes 
the following steps at startup. (1) The master grabs 
a unique master lock in Chubby, which prevents con¬ 
current master instantiations. (2) The master scans the 
servers directory in Chubby to find the live servers. 
(3) The master communicates with every live tablet 
server to discover what tablets are already assigned to 
each server. (4) The master scans the METADATA table 
to learn the set of tablets. Whenever this scan encounters 
a tablet that is not already assigned, the master adds the 
tablet to the set of unassigned tablets, which makes the 
tablet eligible for tablet assignment. 

One complication is that the scan of the METADATA 
table cannot happen until the METADATA tablets have 
been assigned. Therefore, before starting this scan (step 
4), the master adds the root tablet to the set of unassigned 
tablets if an assignment for the root tablet was not dis¬ 
covered during step 3. This addition ensures that the root 
tablet will be assigned. Because the root tablet contains 
the names of all METADATA tablets, the master knows 
about all of them after it has scanned the root tablet. 

The set of existing tablets only changes when a ta¬ 
ble is created or deleted, two existing tablets are merged 
to form one larger tablet, or an existing tablet is split 
into two smaller tablets. The master is able to keep 
track of these changes because it initiates all but the last. 
Tablet splits are treated specially since they are initi¬ 
ated by a tablet server. The tablet server commits the 
split by recording information for the new tablet in the 
METADATA table. When the split has committed, it noti¬ 
fies the master. In case the split notification is lost (either 
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because the tablet server or the master died), the master 
detects the new tablet when it asks a tablet server to load 
the tablet that has now split. The tablet server will notify 
the master of the split, because the tablet entry it finds in 
the METADATA table will specify only a portion of the 
tablet that the master asked it to load. 

5.3 Tablet Serving 

The persistent state of a tablet is stored in GFS, as illus¬ 
trated in Figure 5. Updates are committed to a commit 
log that stores redo records. Of these updates, the re¬ 
cently committed ones are stored in memory in a sorted 
buffer called a memtable; the older updates are stored in a 
sequence of SSTables. To recover a tablet, a tablet server 



Figure 5: Tablet Representation 

reads its metadata from the METADATA table. This meta¬ 
data contains the list of SSTables that comprise a tablet 
and a set of a redo points, which are pointers into any 
commit logs that may contain data for the tablet. The 
server reads the indices of the SSTables into memory and 
reconstructs the memtable by applying all of the updates 
that have committed since the redo points. 

When a write operation arrives at a tablet server, the 
server checks that it is well-formed, and that the sender 
is authorized to perform the mutation. Authorization is 
performed by reading the list of permitted writers from a 
Chubby file (which is almost always a hit in the Chubby 
client cache). A valid mutation is written to the commit 
log. Group commit is used to improve the throughput of 
lots of small mutations [13, 16]. After the write has been 
committed, its contents are inserted into the memtable. 

When a read operation arrives at a tablet server, it is 
similarly checked for well-formedness and proper autho¬ 
rization. A valid read operation is executed on a merged 
view of the sequence of SSTables and the memtable. 
Since the SSTables and the memtable are lexicograph¬ 
ically sorted data structures, the merged view can be 
formed efficiently. 

Incoming read and write operations can continue 
while tablets are split and merged. 


5.4 Compactions 

As write operations execute, the size of the memtable in¬ 
creases. When the memtable size reaches a threshold, the 
memtable is frozen, a new memtable is created, and the 
frozen memtable is converted to an SSTable and written 
to GFS. This minor compaction process has two goals: 
it shrinks the memory usage of the tablet server, and it 
reduces the amount of data that has to be read from the 
commit log during recovery if this server dies. Incom¬ 
ing read and write operations can continue while com¬ 
pactions occur. 

Every minor compaction creates a new SSTable. If this 
behavior continued unchecked, read operations might 
need to merge updates from an arbitrary number of 
SSTables. Instead, we bound the number of such files 
by periodically executing a merging compaction in the 
background. A merging compaction reads the contents 
of a few SSTables and the memtable, and writes out a 
new SSTable. The input SSTables and memtable can be 
discarded as soon as the compaction has finished. 

A merging compaction that rewrites all SSTables 
into exactly one SSTable is called a major compaction. 
SSTables produced by non-major compactions can con¬ 
tain special deletion entries that suppress deleted data in 
older SSTables that are still live. A major compaction, 
on the other hand, produces an SSTable that contains 
no deletion information or deleted data. Bigtable cy¬ 
cles through all of its tablets and regularly applies major 
compactions to them. These major compactions allow 
Bigtable to reclaim resources used by deleted data, and 
also allow it to ensure that deleted data disappears from 
the system in a timely fashion, which is important for 
services that store sensitive data. 

6 Refinements 

The implementation described in the previous section 
required a number of refinements to achieve the high 
performance, availability, and reliability required by our 
users. This section describes portions of the implementa¬ 
tion in more detail in order to highlight these refinements. 

Locality groups 

Clients can group multiple column families together into 
a locality group. A separate SSTable is generated for 
each locality group in each tablet. Segregating column 
families that are not typically accessed together into sep¬ 
arate locality groups enables more efficient reads. For 
example, page metadata in Webtable (such as language 
and checksums) can be in one locality group, and the 
contents of the page can be in a different group: an ap- 
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plication that wants to read the metadata does not need 
to read through all of the page contents. 

In addition, some useful tuning parameters can be 
specified on a per-locality group basis. For example, a lo¬ 
cality group can be declared to be in-memory. SSTables 
for in-memory locality groups are loaded lazily into the 
memory of the tablet server. Once loaded, column fam¬ 
ilies that belong to such locality groups can be read 
without accessing the disk. This feature is useful for 
small pieces of data that are accessed frequently: we 
use it internally for the location column family in the 
METADATA table. 


Compression 

Clients can control whether or not the SSTables for a 
locality group are compressed, and if so, which com¬ 
pression format is used. The user-specified compres¬ 
sion format is applied to each SSTable block (whose size 
is controllable via a locality group specific tuning pa¬ 
rameter). Although we lose some space by compress¬ 
ing each block separately, we benefit in that small por¬ 
tions of an SSTable can be read without decompress¬ 
ing the entire file. Many clients use a two-pass custom 
compression scheme. The first pass uses Bentley and 
Mcllroy’s scheme [6], which compresses long common 
strings across a large window. The second pass uses a 
fast compression algorithm that looks for repetitions in 
a small 16 KB window of the data. Both compression 
passes are very fast—they encode at 100-200 MB/s, and 
decode at 400-1000 MB/s on modern machines. 

Even though we emphasized speed instead of space re¬ 
duction when choosing our compression algorithms, this 
two-pass compression scheme does surprisingly well. 
For example, in Webtable, we use this compression 
scheme to store Web page contents. In one experiment, 
we stored a large number of documents in a compressed 
locality group. For the purposes of the experiment, we 
limited ourselves to one version of each document in¬ 
stead of storing all versions available to us. The scheme 
achieved a 10-to-l reduction in space. This is much 
better than typical Gzip reductions of 3-to-l or 4-to-l 
on HTML pages because of the way Webtable rows are 
laid out: all pages from a single host are stored close 
to each other. This allows the Bentley-Mcllroy algo¬ 
rithm to identify large amounts of shared boilerplate in 
pages from the same host. Many applications, not just 
Webtable, choose their row names so that similar data 
ends up clustered, and therefore achieve very good com¬ 
pression ratios. Compression ratios get even better when 
we store multiple versions of the same value in Bigtable. 


Caching for read performance 

To improve read performance, tablet servers use two lev¬ 
els of caching. The Scan Cache is a higher-level cache 
that caches the key-value pairs returned by the SSTable 
interface to the tablet server code. The Block Cache is a 
lower-level cache that caches SSTables blocks that were 
read from GFS. The Scan Cache is most useful for appli¬ 
cations that tend to read the same data repeatedly. The 
Block Cache is useful for applications that tend to read 
data that is close to the data they recently read (e.g., se¬ 
quential reads, or random reads of different columns in 
the same locality group within a hot row). 

Bloom filters 

As described in Section 5.3, a read operation has to read 
from all SSTables that make up the state of a tablet. 
If these SSTables are not in memory, we may end up 
doing many disk accesses. We reduce the number of 
accesses by allowing clients to specify that Bloom fil¬ 
ters [7] should be created for SSTables in a particu¬ 
lar locality group. A Bloom filter allows us to ask 
whether an SSTable might contain any data for a spec¬ 
ified row/column pair. For certain applications, a small 
amount of tablet server memory used for storing Bloom 
filters drastically reduces the number of disk seeks re¬ 
quired for read operations. Our use of Bloom filters 
also implies that most lookups for non-existent rows or 
columns do not need to touch disk. 

Commit-log implementation 

If we kept the commit log for each tablet in a separate 
log file, a very large number of files would be written 
concurrently in GFS. Depending on the underlying file 
system implementation on each GFS server, these writes 
could cause a large number of disk seeks to write to the 
different physical log files. In addition, having separate 
log files per tablet also reduces the effectiveness of the 
group commit optimization, since groups would tend to 
be smaller. To fix these issues, we append mutations 
to a single commit log per tablet server, co-mingling 
mutations for different tablets in the same physical log 
file [18, 20], 

Using one log provides significant performance ben¬ 
efits during normal operation, but it complicates recov¬ 
ery. When a tablet server dies, the tablets that it served 
will be moved to a large number of other tablet servers: 
each server typically loads a small number of the orig¬ 
inal server’s tablets. To recover the state for a tablet, 
the new tablet server needs to reapply the mutations for 
that tablet from the commit log written by the original 
tablet server. However, the mutations for these tablets 
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were co-mingled in the same physical log file. One ap¬ 
proach would be for each new tablet server to read this 
full commit log file and apply just the entries needed for 
the tablets it needs to recover. However, under such a 
scheme, if 100 machines were each assigned a single 
tablet from a failed tablet server, then the log file would 
be read 100 times (once by each server). 

We avoid duplicating log reads by first sort¬ 
ing the commit log entries in order of the keys 
(table, row name, log sequence number). In the 
sorted output, all mutations for a particular tablet are 
contiguous and can therefore be read efficiently with one 
disk seek followed by a sequential read. To parallelize 
the sorting, we partition the log file into 64 MB seg¬ 
ments, and sort each segment in parallel on different 
tablet servers. This sorting process is coordinated by the 
master and is initiated when a tablet server indicates that 
it needs to recover mutations from some commit log file. 

Writing commit logs to GFS sometimes causes perfor¬ 
mance hiccups for a variety of reasons (e.g., a GFS server 
machine involved in the write crashes, or the network 
paths traversed to reach the particular set of three GFS 
servers is suffering network congestion, or is heavily 
loaded). To protect mutations from GFS latency spikes, 
each tablet server actually has two log writing threads, 
each writing to its own log file; only one of these two 
threads is actively in use at a time. If writes to the ac¬ 
tive log file are performing poorly, the log file writing is 
switched to the other thread, and mutations that are in 
the commit log queue are written by the newly active log 
writing thread. Log entries contain sequence numbers 
to allow the recovery process to elide duplicated entries 
resulting from this log switching process. 

Speeding up tablet recovery 

If the master moves a tablet from one tablet server to 
another, the source tablet server first does a minor com¬ 
paction on that tablet. This compaction reduces recov¬ 
ery time by reducing the amount of uncompacted state in 
the tablet server’s commit log. After finishing this com¬ 
paction, the tablet server stops serving the tablet. Before 
it actually unloads the tablet, the tablet server does an¬ 
other (usually very fast) minor compaction to eliminate 
any remaining uncompacted state in the tablet server’s 
log that arrived while the first minor compaction was 
being performed. After this second minor compaction 
is complete, the tablet can be loaded on another tablet 
server without requiring any recovery of log entries. 

Exploiting immutability 

Besides the SSTable caches, various other parts of the 
Bigtable system have been simplified by the fact that all 


of the SSTables that we generate are immutable. For ex¬ 
ample, we do not need any synchronization of accesses 
to the file system when reading from SSTables. As a re¬ 
sult, concurrency control over rows can be implemented 
very efficiently. The only mutable data structure that is 
accessed by both reads and writes is the memtable. To re¬ 
duce contention during reads of the memtable, we make 
each memtable row copy-on-write and allow reads and 
writes to proceed in parallel. 

Since SSTables are immutable, the problem of perma¬ 
nently removing deleted data is transformed to garbage 
collecting obsolete SSTables. Each tablet’s SSTables are 
registered in the METADATA table. The master removes 
obsolete SSTables as a mark-and-sweep garbage collec¬ 
tion [25] over the set of SSTables, where the METADATA 
table contains the set of roots. 

Finally, the immutability of SSTables enables us to 
split tablets quickly. Instead of generating a new set of 
SSTables for each child tablet, we let the child tablets 
share the SSTables of the parent tablet. 

7 Performance Evaluation 

We set up a Bigtable cluster with N tablet servers to 
measure the performance and scalability of Bigtable as 
N is varied. The tablet servers were configured to use 1 
GB of memory and to write to a GFS cell consisting of 
1786 machines with two 400 GB IDE hard drives each. 
N client machines generated the Bigtable load used for 
these tests. (We used the same number of clients as tablet 
servers to ensure that clients were never a bottleneck.) 
Each machine had two dual-core Opteron 2 GHz chips, 
enough physical memory to hold the working set of all 
running processes, and a single gigabit Ethernet link. 
The machines were arranged in a two-level tree-shaped 
switched network with approximately 100-200 Gbps of 
aggregate bandwidth available at the root. All of the ma¬ 
chines were in the same hosting facility and therefore the 
round-trip time between any pair of machines was less 
than a millisecond. 

The tablet servers and master, test clients, and GFS 
servers all ran on the same set of machines. Every ma¬ 
chine ran a GFS server. Some of the machines also ran 
either a tablet server, or a client process, or processes 
from other jobs that were using the pool at the same time 
as these experiments. 

R is the distinct number of Bigtable row keys involved 
in the test. R was chosen so that each benchmark read or 
wrote approximately 1 GB of data per tablet server. 

The sequential write benchmark used row keys with 
names 0 to R — 1. This space of row keys was parti¬ 
tioned into 10W equal-sized ranges. These ranges were 
assigned to the N clients by a central scheduler that as- 
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Experiment 

# of Tablet Servers i 

1 

50 

250 

500 

random reads 

1212 

593 

479 

241 

random reads (mem) 

10811 

8511 

8000 

6250 

random writes 

8850 

3745 

3425 

2000 

sequential reads 

4425 

2463 

2625 

2469 

sequential writes 

8547 

3623 

2451 

1905 

scans 

15385 

10526 

9524 

7843 


Figure 6: Number of 1000-byte values read/written per second, 
aggregate rate. 

signed the next available range to a client as soon as the 
client finished processing the previous range assigned to 
it. This dynamic assignment helped mitigate the effects 
of performance variations caused by other processes run¬ 
ning on the client machines. We wrote a single string un¬ 
der each row key. Each string was generated randomly 
and was therefore uncompressible. In addition, strings 
under different row key were distinct, so no cross-row 
compression was possible. The random write benchmark 
was similar except that the row key was hashed modulo 
R immediately before writing so that the write load was 
spread roughly uniformly across the entire row space for 
the entire duration of the benchmark. 

The sequential read benchmark generated row keys in 
exactly the same way as the sequential write benchmark, 
but instead of writing under the row key, it read the string 
stored under the row key (which was written by an earlier 
invocation of the sequential write benchmark). Similarly, 
the random read benchmark shadowed the operation of 
the random write benchmark. 

The scan benchmark is similar to the sequential read 
benchmark, but uses support provided by the Bigtable 
API for scanning over all values in a row range. Us¬ 
ing a scan reduces the number of RPCs executed by the 
benchmark since a single RPC fetches a large sequence 
of values from a tablet server. 

The random reads (mem) benchmark is similar to the 
random read benchmark, but the locality group that con¬ 
tains the benchmark data is marked as in-memory , and 
therefore the reads are satisfied from the tablet server’s 
memory instead of requiring a GFS read. For just this 
benchmark, we reduced the amount of data per tablet 
server from 1 GB to 100 MB so that it would fit com¬ 
fortably in the memory available to the tablet server. 

Figure 6 shows two views on the performance of our 
benchmarks when reading and writing 1000-byte values 
to Bigtable. The table shows the number of operations 
per second per tablet server; the graph shows the aggre¬ 
gate number of operations per second. 



The table shows the rate per tablet server; the graph shows the 


Single tablet-server performance 

Fet us first consider performance with just one tablet 
server. Random reads are slower than all other operations 
by an order of magnitude or more. Each random read in¬ 
volves the transfer of a 64 KB SSTable block over the 
network from GFS to a tablet server, out of which only a 
single 1000-byte value is used. The tablet server executes 
approximately 1200 reads per second, which translates 
into approximately 75 MB/s of data read from GFS. This 
bandwidth is enough to saturate the tablet server CPUs 
because of overheads in our networking stack, SSTable 
parsing, and Bigtable code, and is also almost enough 
to saturate the network links used in our system. Most 
Bigtable applications with this type of an access pattern 
reduce the block size to a smaller value, typically 8KB. 

Random reads from memory are much faster since 
each 1000-byte read is satisfied from the tablet server’s 
local memory without fetching a large 64 KB block from 
GFS. 

Random and sequential writes perform better than ran¬ 
dom reads since each tablet server appends all incoming 
writes to a single commit log and uses group commit to 
stream these writes efficiently to GFS. There is no sig¬ 
nificant difference between the performance of random 
writes and sequential writes; in both cases, all writes to 
the tablet server are recorded in the same commit log. 

Sequential reads perform better than random reads 
since every 64 KB SSTable block that is fetched from 
GFS is stored into our block cache, where it is used to 
serve the next 64 read requests. 

Scans are even faster since the tablet server can return 
a large number of values in response to a single client 
RPC, and therefore RPC overhead is amortized over a 
large number of values. 

Scaling 

Aggregate throughput increases dramatically, by over a 
factor of a hundred, as we increase the number of tablet 
servers in the system from 1 to 500. For example, the 
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# of tablet servers 

# of clusters 

0 .. 

19 

259 

20 .. 

49 

47 

50 .. 

99 

20 

100 .. 

499 

50 

> 500 


12 


Table 1: Distribution of number of tablet servers in Bigtable 
clusters. 

performance of random reads from memory increases by 
almost a factor of 300 as the number of tablet server in¬ 
creases by a factor of 500. This behavior occurs because 
the bottleneck on performance for this benchmark is the 
individual tablet server CPU. 

However, performance does not increase linearly. For 
most benchmarks, there is a significant drop in per-server 
throughput when going from 1 to 50 tablet servers. This 
drop is caused by imbalance in load in multiple server 
configurations, often due to other processes contending 
for CPU and network. Our load balancing algorithm at¬ 
tempts to deal with this imbalance, but cannot do a per¬ 
fect job for two main reasons: rebalancing is throttled to 
reduce the number of tablet movements (a tablet is un¬ 
available for a short time, typically less than one second, 
when it is moved), and the load generated by our bench¬ 
marks shifts around as the benchmark progresses. 

The random read benchmark shows the worst scaling 
(an increase in aggregate throughput by only a factor of 
100 for a 500-fold increase in number of servers). This 
behavior occurs because (as explained above) we transfer 
one large 64KB block over the network for every 1000- 
byte read. This transfer saturates various shared 1 Gi¬ 
gabit links in our network and as a result, the per-server 
throughput drops significantly as we increase the number 
of machines. 

8 Real Applications 

As of August 2006, there are 388 non-test Bigtable clus¬ 
ters running in various Google machine clusters, with a 
combined total of about 24,500 tablet servers. Table 1 
shows a rough distribution of tablet servers per cluster. 
Many of these clusters are used for development pur¬ 
poses and therefore are idle for significant periods. One 
group of 14 busy clusters with 8069 total tablet servers 
saw an aggregate volume of more than 1.2 million re¬ 
quests per second, with incoming RPC traffic of about 
741 MB/s and outgoing RPC traffic of about 16 GB/s. 

Table 2 provides some data about a few of the tables 
currently in use. Some tables store data that is served 
to users, whereas others store data for batch processing; 
the tables range widely in total size, average cell size. 


percentage of data served from memory, and complexity 
of the table schema. In the rest of this section, we briefly 
describe how three product teams use Bigtable. 

8.1 Google Analytics 

Google Analytics (analytics.google.com) is a service 
that helps webmasters analyze traffic patterns at their 
web sites. It provides aggregate statistics, such as the 
number of unique visitors per day and the page views 
per URL per day, as well as site-tracking reports, such as 
the percentage of users that made a purchase, given that 
they earlier viewed a specific page. 

To enable the service, webmasters embed a small 
JavaScript program in their web pages. This program 
is invoked whenever a page is visited. It records various 
information about the request in Google Analytics, such 
as a user identifier and information about the page be¬ 
ing fetched. Google Analytics summarizes this data and 
makes it available to webmasters. 

We briefly describe two of the tables used by Google 
Analytics. The raw click table ('200 TB) maintains a 
row for each end-user session. The row name is a tuple 
containing the website’s name and the time at which the 
session was created. This schema ensures that sessions 
that visit the same web site are contiguous, and that they 
are sorted chronologically. This table compresses to 14% 
of its original size. 

The summary table ('20 TB) contains various prede¬ 
fined summaries for each website. This table is gener¬ 
ated from the raw click table by periodically scheduled 
MapReduce jobs. Each MapReduce job extracts recent 
session data from the raw click table. The overall sys¬ 
tem’s throughput is limited by the throughput of GFS. 
This table compresses to 29% of its original size. 

8.2 Google Earth 

Google operates a collection of services that provide 
users with access to high-resolution satellite imagery of 
the world’s surface, both through the web-based Google 
Maps interface (maps.google.com) and through the 
Google Earth (earth.google.com) custom client soft¬ 
ware. These products allow users to navigate across the 
world’s surface: they can pan, view, and annotate satel¬ 
lite imagery at many different levels of resolution. This 
system uses one table to preprocess data, and a different 
set of tables for serving client data. 

The preprocessing pipeline uses one table to store raw 
imagery. During preprocessing, the imagery is cleaned 
and consolidated into final serving data. This table con¬ 
tains approximately 70 terabytes of data and therefore is 
served from disk. The images are efficiently compressed 
already, so Bigtable compression is disabled. 
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Project 

name 

Table size 
(TB) 

Compression 

ratio 

# Cells 
(billions) 

# Column 
Families 

# Locality 
Groups 

% in 

memory 

Latency- 

sensitive? 

Crawl 

800 

11% 

1000 

16 

8 

0% 

No 

Crawl 

50 

33% 

200 

2 

2 

0% 

No 

Google Analytics 

20 

29% 

10 

1 

1 

0% 

Yes 

Google Analytics 

200 

14% 

80 

1 

1 

0% 

Yes 

Google Base 

2 

31% 

10 

29 

3 

15% 

Yes 

Google Earth 

0.5 

64% 

8 

7 

2 

33% 

Yes 

Google Earth 

70 

- 

9 

8 

3 

0% 

No 

Orkut 

9 

- 

0.9 

8 

5 

1% 

Yes 

Personalized Search 

4 

47% 

6 

93 

11 

5% 

Yes 


Table 2: Characteristics of a few tables in production use. Table size (measured before compression) and # Cells indicate approxi¬ 
mate sizes. Compression ratio is not given for tables that have compression disabled. 


Each row in the imagery table corresponds to a sin¬ 
gle geographic segment. Rows are named to ensure that 
adjacent geographic segments are stored near each other. 
The table contains a column family to keep track of the 
sources of data for each segment. This column family 
has a large number of columns: essentially one for each 
raw data image. Since each segment is only built from a 
few images, this column family is very sparse. 

The preprocessing pipeline relies heavily on MapRe¬ 
duce over Bigtable to transform data. The overall system 
processes over 1 MB/sec of data per tablet server during 
some of these MapReduce jobs. 

The serving system uses one table to index data stored 
in GFS. This table is relatively small ('500 GB), but it 
must serve tens of thousands of queries per second per 
datacenter with low latency. As a result, this table is 
hosted across hundreds of tablet servers and contains in¬ 
memory column families. 

8.3 Personalized Search 

Personalized Search (www.google.com/psearch) is an 
opt-in service that records user queries and clicks across 
a variety of Google properties such as web search, im¬ 
ages, and news. Users can browse their search histories 
to revisit their old queries and clicks, and they can ask 
for personalized search results based on their historical 
Google usage patterns. 

Personalized Search stores each user’s data in 
Bigtable. Each user has a unique userid and is assigned 
a row named by that userid. All user actions are stored 
in a table. A separate column family is reserved for each 
type of action (for example, there is a column family that 
stores all web queries). Each data element uses as its 
Bigtable timestamp the time at which the corresponding 
user action occurred. Personalized Search generates user 
profiles using a MapReduce over Bigtable. These user 
profiles are used to personalize live search results. 


The Personalized Search data is replicated across sev¬ 
eral Bigtable clusters to increase availability and to re¬ 
duce latency due to distance from clients. The Personal¬ 
ized Search team originally built a client-side replication 
mechanism on top of Bigtable that ensured eventual con¬ 
sistency of all replicas. The current system now uses a 
replication subsystem that is built into the servers. 

The design of the Personalized Search storage system 
allows other groups to add new per-user information in 
their own columns, and the system is now used by many 
other Google properties that need to store per-user con¬ 
figuration options and settings. Sharing a table amongst 
many groups resulted in an unusually large number of 
column families. To help support sharing, we added a 
simple quota mechanism to Bigtable to limit the stor¬ 
age consumption by any particular client in shared ta¬ 
bles; this mechanism provides some isolation between 
the various product groups using this system for per-user 
information storage. 

9 Lessons 

In the process of designing, implementing, maintaining, 
and supporting Bigtable, we gained useful experience 
and learned several interesting lessons. 

One lesson we learned is that large distributed sys¬ 
tems are vulnerable to many types of failures, not just 
the standard network partitions and fail-stop failures as¬ 
sumed in many distributed protocols. For example, we 
have seen problems due to all of the following causes: 
memory and network corruption, large clock skew, hung 
machines, extended and asymmetric network partitions, 
bugs in other systems that we are using (Chubby for ex¬ 
ample), overflow of GFS quotas, and planned and un¬ 
planned hardware maintenance. As we have gained more 
experience with these problems, we have addressed them 
by changing various protocols. For example, we added 
checksumming to our RPC mechanism. We also handled 


To appear in OSDI2006 


11 



















































































some problems by removing assumptions made by one 
part of the system about another part. For example, we 
stopped assuming a given Chubby operation could return 
only one of a fixed set of errors. 

Another lesson we learned is that it is important to 
delay adding new features until it is clear how the new 
features will be used. For example, we initially planned 
to support general-purpose transactions in our API. Be¬ 
cause we did not have an immediate use for them, how¬ 
ever, we did not implement them. Now that we have 
many real applications running on Bigtable, we have 
been able to examine their actual needs, and have discov¬ 
ered that most applications require only single-row trans¬ 
actions. Where people have requested distributed trans¬ 
actions, the most important use is for maintaining sec¬ 
ondary indices, and we plan to add a specialized mech¬ 
anism to satisfy this need. The new mechanism will 
be less general than distributed transactions, but will be 
more efficient (especially for updates that span hundreds 
of rows or more) and will also interact better with our 
scheme for optimistic cross-data-center replication. 

A practical lesson that we learned from supporting 
Bigtable is the importance of proper system-level mon¬ 
itoring (i.e., monitoring both Bigtable itself, as well as 
the client processes using Bigtable). For example, we ex¬ 
tended our RPC system so that for a sample of the RPCs, 
it keeps a detailed trace of the important actions done on 
behalf of that RPC. This feature has allowed us to de¬ 
tect and fix many problems such as lock contention on 
tablet data structures, slow writes to GFS while com¬ 
mitting Bigtable mutations, and stuck accesses to the 
METADATA table when METADATA tablets are unavail¬ 
able. Another example of useful monitoring is that ev¬ 
ery Bigtable cluster is registered in Chubby. This allows 
us to track down all clusters, discover how big they are, 
see which versions of our software they are running, how 
much traffic they are receiving, and whether or not there 
are any problems such as unexpectedly large latencies. 

The most important lesson we learned is the value 
of simple designs. Given both the size of our system 
(about 100,000 lines of non-test code), as well as the 
fact that code evolves over time in unexpected ways, we 
have found that code and design clarity are of immense 
help in code maintenance and debugging. One exam¬ 
ple of this is our tablet-server membership protocol. Our 
first protocol was simple: the master periodically issued 
leases to tablet servers, and tablet servers killed them¬ 
selves if their lease expired. Unfortunately, this proto¬ 
col reduced availability significantly in the presence of 
network problems, and was also sensitive to master re¬ 
covery time. We redesigned the protocol several times 
until we had a protocol that performed well. However, 
the resulting protocol was too complex and depended on 


the behavior of Chubby features that were seldom exer¬ 
cised by other applications. We discovered that we were 
spending an inordinate amount of time debugging ob¬ 
scure corner cases, not only in Bigtable code, but also in 
Chubby code. Eventually, we scrapped this protocol and 
moved to a newer simpler protocol that depends solely 
on widely-used Chubby features. 

10 Related Work 

The Boxwood project [24] has components that overlap 
in some ways with Chubby, GFS, and Bigtable, since it 
provides for distributed agreement, locking, distributed 
chunk storage, and distributed B-tree storage. In each 
case where there is overlap, it appears that the Box¬ 
wood’s component is targeted at a somewhat lower level 
than the corresponding Google service. The Boxwood 
project’s goal is to provide infrastructure for building 
higher-level services such as file systems or databases, 
while the goal of Bigtable is to directly support client 
applications that wish to store data. 

Many recent projects have tackled the problem of pro¬ 
viding distributed storage or higher-level services over 
wide area networks, often at “Internet scale.” This in¬ 
cludes work on distributed hash tables that began with 
projects such as CAN [29], Chord [32], Tapestry [37], 
and Pastry [30]. These systems address concerns that do 
not arise for Bigtable, such as highly variable bandwidth, 
untrusted participants, or frequent reconfiguration; de¬ 
centralized control and Byzantine fault tolerance are not 
Bigtable goals. 

In terms of the distributed data storage model that one 
might provide to application developers, we believe the 
key-value pair model provided by distributed B-trees or 
distributed hash tables is too limiting. Key-value pairs 
are a useful building block, but they should not be the 
only building block one provides to developers. The 
model we chose is richer than simple key-value pairs, 
and supports sparse semi-structured data. Nonetheless, 
it is still simple enough that it lends itself to a very effi¬ 
cient flat-file representation, and it is transparent enough 
(via locality groups) to allow our users to tune important 
behaviors of the system. 

Several database vendors have developed parallel 
databases that can store large volumes of data. Oracle’s 
Real Application Cluster database [27] uses shared disks 
to store data (Bigtable uses GFS) and a distributed lock 
manager (Bigtable uses Chubby). IBM’s DB2 Parallel 
Edition [4] is based on a shared-nothing [33] architecture 
similar to Bigtable. Each DB2 server is responsible for 
a subset of the rows in a table which it stores in a local 
relational database. Both products provide a complete 
relational model with transactions. 
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Bigtable locality groups realize similar compression 
and disk read performance benefits observed for other 
systems that organize data on disk using column-based 
rather than row-based storage, including C-Store [1, 34] 
and commercial products such as Sybase IQ [15, 36], 
SenSage [31], KDB+ [22], and the ColumnBM storage 
layer in MonetDB/XlOO [38], Another system that does 
vertical and horizontal data pardoning into flat files and 
achieves good data compression ratios is AT&T’s Day¬ 
tona database [19], Locality groups do not support CPU- 
cache-level optimizations, such as those described by 
Ailamaki [2], 

The manner in which Bigtable uses memtables and 
SSTables to store updates to tablets is analogous to the 
way that the Log-Structured Merge Tree [26] stores up¬ 
dates to index data. In both systems, sorted data is 
buffered in memory before being written to disk, and 
reads must merge data from memory and disk. 

C-Store and Bigtable share many characteristics: both 
systems use a shared-nothing architecture and have two 
different data structures, one for recent writes, and one 
for storing long-lived data, with a mechanism for mov¬ 
ing data from one form to the other. The systems dif¬ 
fer significantly in their API: C-Store behaves like a 
relational database, whereas Bigtable provides a lower 
level read and write interface and is designed to support 
many thousands of such operations per second per server. 
C-Store is also a “read-optimized relational DBMS”, 
whereas Bigtable provides good performance on both 
read-intensive and write-intensive applications. 

Bigtable’s load balancer has to solve some of the same 
kinds of load and memory balancing problems faced by 
shared-nothing databases (e.g., [11, 35]). Our problem is 
somewhat simpler: (1) we do not consider the possibility 
of multiple copies of the same data, possibly in alternate 
forms due to views or indices; (2) we let the user tell us 
what data belongs in memory and what data should stay 
on disk, rather than trying to determine this dynamically; 
(3) we have no complex queries to execute or optimize. 

11 Conclusions 

We have described Bigtable, a distributed system for 
storing structured data at Google. Bigtable clusters have 
been in production use since April 2005, and we spent 
roughly seven person-years on design and implementa¬ 
tion before that date. As of August 2006, more than sixty 
projects are using Bigtable. Our users like the perfor¬ 
mance and high availability provided by the Bigtable im¬ 
plementation, and that they can scale the capacity of their 
clusters by simply adding more machines to the system 
as their resource demands change over time. 


Given the unusual interface to Bigtable, an interest¬ 
ing question is how difficult it has been for our users to 
adapt to using it. New users are sometimes uncertain of 
how to best use the Bigtable interface, particularly if they 
are accustomed to using relational databases that support 
general-purpose transactions. Nevertheless, the fact that 
many Google products successfully use Bigtable demon¬ 
strates that our design works well in practice. 

We are in the process of implementing several addi¬ 
tional Bigtable features, such as support for secondary 
indices and infrastructure for building cross-data-center 
replicated Bigtables with multiple master replicas. We 
have also begun deploying Bigtable as a service to prod¬ 
uct groups, so that individual groups do not need to main¬ 
tain their own clusters. As our service clusters scale, 
we will need to deal with more resource-sharing issues 
within Bigtable itself [3, 5], 

Finally, we have found that there are significant ad¬ 
vantages to building our own storage solution at Google. 
We have gotten a substantial amount of flexibility from 
designing our own data model for Bigtable. In addi¬ 
tion, our control over Bigtable’s implementation, and 
the other Google infrastructure upon which Bigtable de¬ 
pends, means that we can remove bottlenecks and ineffi¬ 
ciencies as they arise. 
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